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Abstract Models for which the likelihood function can 
be evaluated only up to a parameter-dependent un¬ 
known normalizing constant, such as Markov random 
field models, are used widely in computer science, stat¬ 
istical physics, spatial statistics, and network analysis. 
However, Bayesian analysis of these models using stand¬ 
ard Monte Carlo methods is not possible due to the in¬ 
tractability of their likelihood functions. Several meth¬ 
ods that permit exact, or close to exact, simulation from 
the posterior distribution have recently been developed. 
However, estimating the evidence and Bayes’ factors 
(BFs) for these models remains challenging in general. 
This paper describes new random weight importance 
sampling and sequential Monte Carlo methods for es¬ 
timating BFs that use simulation to circumvent the 
evaluation of the intractable likelihood, and compares 
them to existing methods. In some cases we observe 
an advantage in the use of biased weight estimates. An 
initial investigation into the theoretical and empirical 
properties of this class of methods is presented. Some 
support for the use of biased estimates is presented, but 
we advocate caution in the use of such estimates. 

Keywords approximate Bayesian computation • 
Bayes’ factors • importance sampling • marginal 
likelihood • Markov random field • partition function • 
sequential Monte Carlo 


R. G. Everitt • E. Rowing • M. Evdemon-Hogan 
Department of Mathematics and Statistics, University of 
Reading, UK. 

E-mail: r.g.everitt@reading.ac.uk 
A. M. Johansen 

Department of Statistics, University of Warwick, Coventry, 
CV4 7AL, UK. 

E-mail: a. m. j ohansen® war wick .ac.uk 


1 Introduction 


There has been much recent interest in performing 
Bayesian inference in models where the posterior is in¬ 
tractable and, in particular, we have the situation in 
which the posterior distribution 7r(d|y) oc p{d)f{y\0), 
cannot be evaluated pointwise. This intractability typ¬ 
ically occurs occurs due to the intractability of the like¬ 
lihood, i.e. f{y\d) cannot be evaluated pointwise. Ex¬ 
ample scenarios include: 

1. the use of big data sets, where f{y\0) consists of a 
product of a large number of terms; 

2. the existence of a large number of latent variables 

X, with f{y\0) known only as a high dimensional 
integral/(y16») = f{y,x\9)dx; 

3. when f{y\9) = being an in¬ 

tractable normalising constant (INC) for the tract¬ 
able term 'y{y\0) (e.g. when / factorises as a Markov 
random field); 

4. where it is possible to sample from f{-\0), but not 
to evaluate it, such as when the distribution of the 
data given 9 is modelled by a complex stochastic 
computer model. 


Each of these (overlapping) situations has been con¬ 
sidered in some detail in previous work and each has 
inspired different methodologies. 

In this paper we focus on the third case, in which 
the likelihood has an INC. This is an important prob¬ 
lem in its own right (Girolami et al (20131 refer to it as 
“one of the main challenges to methodology for compu¬ 
tational statistics currently”). There exist several com¬ 
peting methodologies for inference in this setting (see 
EverittI (20121). In particular, the “exact” approaches 


of Mpller et al (2006) and [Murray et al (20061 exploit 
the decomposition f{y\9) = ■z^liy\9), whereas “sim- 
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ulation based” methods such as approximate Bayesian 


computation (ABC) (Grelaud et al 20091 do not de¬ 


pend upon such a decomposition and can be applied 
more generally: to situation 1 in |Picc hini an d Form^ 
(2013); situations 2 and 3 (e.g.|Eventt (2012|) and situ¬ 
ation 4 (e.g. Wilkinson (2013)). 

This paper considers the problem of Bayesian model 
comparison in the presence of an INC. We explore both 
exact and simulation-based methods, and find that ele¬ 
ments of both approaches may also be more generally 
applicable. Specifically: 

— For exact methods we find that approximations are 
required to allow practical implementation, and this 
leads us to investigate the use of approximate weights 
in importance sampling (IS) and sequential Monte 
Carlo (SMC). We examine the use of both exact- 


approximate approaches (as in Fearnhead et al (2010)) 


and also ^‘inexact-approximate” methods, in which 
complete flexibility is allowed in the approximation 
of weights, at the cost of losing the exactness of 
the method. This work is a natural counterpart to 


Alquier et al (2015), which examines ananalogous 


question (concerning the acceptance probability) for 
Markov chain Monte Carlo (MCMC) algorithms. 
These generally applicable methods, “noisy MCMC” 


(Alquier et al 2015) and “noisy SMC” (this paper) 
have some potential to address situations 1-3. 

We provide some comparison of these inexact ap¬ 
proximations with simulation-based methods, includ¬ 


ing the “synthetic likelihood” (SL) of Wood (2010). 


In the applications considered here we hnd this to be 
a viable alternative to ABC. Our results are suggest¬ 
ive that this, and related methods, may find success 
in scenarios in which ABC is more usually applied. 
In the remainder of this section we briefly outline the 
problem of, and methods for, parameter inference in 
the presence of an INC. We then detail the problem 
of Bayesian model comparison in this context, before 
discussing methods for addressing it in the following 
two sections. 


1.1 Parameter inference 

In this section we consider the problem of simulating 
from7r(0|2/) (x p{d)j{y\d)/Z{9) using MCMC. This prob¬ 
lem has been well studied, and such models are termed 
doubly intractable because the acceptance probability 
in the Metropolis-Bastings (MB) algorithm 

qie\e*) Pie*) ^{y\e*) z{e) 


min < 1 


qie*\9) Pie) ^iy\e) zie*) 


( 1 ) 


cannot be evaluated due to the presence of the INC. We 
first review exact methods for simulating from such a 


target in sections l.l.l|l.l)3 before looking at simulation- 
based methods in sections [l.l.4l and B.1.51 The methods 
described here in the context of MCMC form the basis 
of the methods for evidence estimation developed in the 
remainder of the paper. 

1.1.1 Single and multiple auxiliary variable methods 


Mpller et al (2006) avoid the evaluation of the INC 


by augmenting the target distribution with an extra 
variable u that lies on the same space as y, and use an 
MB algorithm with target distribution 

Trie, u\y) oc quiu\e,y)fiy\e)pie), 

where is some (normalised) arbitrary distribution. 
As the MB proposal in (0, u)-space they use 


ie*,u*)^ fiu*\e*)qie*\e), 

giving an acceptance probability of 

qie\e*)pie*) -fiy\e*) quiu*\e*,y) 7(m|6») 


min < 1 


’ qie*\e) pie) jiy\e) jiu*\e*) quiu\e,y) 


Note that, by viewing quiu*\e*,y)/jiu*\e*) as an 
unbiased IS estimator of 1/Z(0*), this algorithm can 
be seen as an instance of the exact approximations de¬ 


scribed in Beaumont (2003) and Andrieu and Roberts 


(2009), where it is established that if an unbiased es¬ 


timator of a target density is used appropriately in an 
MB algorithm, the d-marginal of the invariant distri¬ 
bution of this chain is the target distribution of inter¬ 
est. This automatically suggests extensions to the sin¬ 
gle auxiliary variable (SAV) method described above, 
where M > I importance points are used, yielding: 


1 


m 


M ^ 7(m(™)|6>) 

m—1 ' ^ ' ' 


( 2 ) 


Andrieu and Vihola (2012) show that the reduced vari¬ 


ance of this estimator leads to a reduced asymptotic 
variance of estimators from the resultant Markov chain. 
The variance of the IS estimator is strongly dependent 
on an appropriate choice of IS target q„(-|0, j/), which 


should have lighter tails than /(-jd). Mpller et al (2006) 
suggest that a reasonable choice may be qui-\e,y) = 
/(•|0), where 9 is the maximum likelihood estimator 
of 9. Bowever, in practice qui-\e,y) can be difficult to 
choose well, particularly when y lies on a high dimen¬ 
sional space. Motivated by this, annealed importance 


sampling (AIS) (Neal 2001) can be used as an alter¬ 


native to IS, leading to the multiple auxiliary variable 


(MAV) method of Murray et al (2006). AIS makes use 
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of a sequence of K targets, which in Murray et al (20061 
are chosen to be 

fk{-\0,0,y) oc 

between f{-\0) and g„(-|6*, y). After the initial draw uk+i 
/(•|0), the auxiliary point is taken through a sequence of 
K MCMC moves which successively have target 
fk{-\d,d,y) tor k = K : 1. The resultant IS estimator is 
given by 


ZiO) 


1 

M 


lk{u^,^\\9,9,y) 
=ik=i 7fc-i(4-il^>^>2/) 


M K 

En 


(4) 


(see Zhou et al (20151 for an illustration of the potential 
improvement and some discussion); we do not pursue 
this here as it is not the focus of this work. 

1.1.2 Exchange algorithms 

An alternative approach to avoiding the ratio of INCs 
in equation 0 is given by [Murray et al| ( [2006| ), in which 
it is suggested to use the acceptance probability 


min < 1 


q{0\0*)p{e*) jjyie*) 'riu\e) 
qi0*\0) p{0) ^{y\0) -i{u\0*) 


where u ~ f{'\0*), motivated by the intuitive idea that 
'y{u\9)/^(u\0*) is a single point IS estimator of 
Z{0)IZ(0*). This method is shown to have the correct 
invariant distribution, as is the extension in which AIS 
is used in place of IS. A potential extension might seem 
to be using multiple importance points ^ 

f{-\0*) to obtain an estimator of Z{0)/Z{0*) that has 
a smaller variance, with the aim of improving the sta¬ 
tistical efficiency of estimators based on the resultant 
Markov chain. This scheme is shown to work well em- 


noisy kernel. It also describes additional noisy MCMC 
algorithms for approximately simulating from the pos¬ 
terior, based on Langevin dynamics. 

1.1.3 Russian Roulette and other approaches 


This estimator has a lower variance (although at a higher 
computational cost) than the corresponding IS estima¬ 
tor. We note that AIS can be viewed as a particular 
case of SMC without resampling and one might expect 
to obtain additional improvements at negligible cost by 
incorporating resampling steps within such algorithms 


Girolami et al (2013) use series-based approximations 
to intractable target distributions within the 
exact-approximation framework, where “Russian Roul¬ 
ette” methods from the physics literature are used to 
ensure the unbiasedness of truncations of infinite sums. 
These methods do not require exact simulation from 
f{-\0*), as do the SAV and exchange approaches de¬ 
scribed in the previous two sections. However, SAV and 
exchange are often implemented in practice by generat¬ 
ing the auxiliary variables by taking the Hnal point of a 
long “internal” MCMC run in place of exact simulation 


(e.g Caimo and Friel (2011)). For Hnite runs of the in¬ 


ternal MCMC, this approach will not have exactly the 


desired invariant distribution, but Everitt (2012) shows 


that under regularity conditions the bias introduced by 
this approximation tends to zero as the run length of 
the internal MCMC increases: the same proof holds for 
the use of an MCMC chain for the simulation within 
an ABC-MCMC (i.e. MCMC applied to an ABC ap¬ 


proximation of the posterior. Marjoram et al (2003)) 
or SL-MCMC (i.e. MCMC applied to an SL approx¬ 
imation) algorithm, as described in sections 


1.1.5 Although the approach of Girolami et al (2013) is 


1.1.4 


and 


exact, as they note it is significantly more computation¬ 
ally expensive than this approximate approach. For this 
reason, we do not pursue Russian Roulette approaches 
further in this paper. 

When a rejection sampler is available for simulating 


from f{-\0*), Rao et al (2013) introduce an alternative 


exact algorithm that has some favourable properties 
compared to the exchange algorithm. Since a rejection 
sampler is not available in many cases, we do not pursue 
this approach further. 

1 . 1.4 Approximate Bayesian computation 


Alquier et al 

(2015 

) However this chain Approximate Bayesian Gomputation ( 

Tavare et al 1997 


does not have the desired target as its invariant dis¬ 
tribution. Instead it can be seen as part of a wider 
class of algorithms that use a noisy estimate of the 
acceptance probability: noisy Monte Carlo algorithms 
(also referred to as “inexact approximations” in Giro- 


likelihood f{y\0) through the integral 
J{S{y)\0) oc [ n,{S{y)\S{u))f{u\0)du 


(5) 


lami et al (2013 )). [Alquier et al (2015) shows that under 
uniform ergodicity of the ideal chain, a bound on the 
expected difference between the noisy and true accep¬ 
tance probabilities can lead to bounds on the distance 
between the desired target distribution and the iterated 


where S{-) gives a vector of summary statistics and 
TTg {S{y)\S{u)) is the density of a symmetric kernel with 
bandwidth e, centered at S{u) and evaluated at S{y). 
As e —> 0, this distribution becomes more concentrated, 
so that in the case where S{-) gives sufficient statistics 
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for estimating 0, as e —> 0 the approximate posterior 
becomes closer to the true posterior. This approxima¬ 
tion is used within standard Monte Carlo methods for 
simulating from the posterior. For example, it may be 
used within an MCMC algorithm, where using an exact- 
approximation argument it can be seen that it is suf¬ 
ficient in the calculation of the acceptance probability 
to use the Monte Carlo approximation 


M 


( 6 ) 


for the likelihood at 6 * at each iteration, where 
/('l^*)- Whilst the exact-approximation 
argument means that there is no additional bias due 
to this Monte Carlo approximation, the approximation 
introduced through using a tolerance e > 0 or insuf¬ 
ficient summary statistics may be large. For this rea¬ 
son it might be considered a last resort to use ABC on 
likelihoods with an INC, but previous success on these 


models (e.g Grelaud et al (2009) and Everitt (2012)) 
lead us to consider them further in this paper. 


1.1.5 Synthetic likelihood 

ABC is essentially using, based on simulations from /, 
a nonparameteric estimator of fs (<S'|0), the distribu¬ 
tion of the summary statistics of the data given 6 . In 
some situations, a parametric model might be more ap¬ 
propriate. For example, if the statistic is the sum of 
independent random variables, a Central Limit Theo¬ 
rem (CLT) might imply that it would be appropriate 
to assume that fs (S'|0) is a multivariate Gaussian. 

The SL approach ( Wood|[2M0 ) proceeds by making 
exactly this Gaussian assumption and uses this approx¬ 
imate likelihood within an MCMC algorithm. The pa¬ 
rameters (the mean and variance) of this approximat¬ 
ing distribution for a given 9 are estimated based on the 
summary statistics of simulations ^ 

Concretely, the estimate of the likelihood is 


/SL iSiy)\e)=M(^S{yffio,Se 


where 


Me = 


M 


M 

E-5 («<”>) 


S, = 


ss 


m—1 


M- 1’ 


(7) 


with s = {S (mi) — fig,..., S (um) — Me) • Wood ( 2010| 
applies this method in a setting where the summary 
statistics are regression coefficients, motivated by their 
distribution being approximately normal. One of the 
approximations inherent in this method, as in ABC, 
is the use of summary statistics rather than the whole 


dataset. However, unlike ABC, there is no need to choose 
a bandwidth e: this approximation is replaced with that 
arising from the discrepancy between the normal ap¬ 
proximation and the exact distribution of the chosen 
summary statistic. The SL method remains approxi¬ 
mate even if the summary statistic distribution is Gaus¬ 
sian as /gL is not an unbiased estimate of the density 
and so the exact-approximation results do not apply. 
Rather, this is a special case of noisy MGMG, and we 
do not expect the additional bias introduced by esti¬ 
mating the parameters of /g^ to have large effects on 
the results, even if the parameters are estimated via an 
internal MGMG chain targeting f{-\6) as described in 
section 11.1.31 

SL is related to a number of other simulation based 
algorithms under the umbrella of Bayesian indirect in¬ 
ference ( Drovandi et al|[2015 |. This suggests a number 
of extensions to some of the methods presented in this 
paper that we do not explore here. 


1.2 Bayesian model comparison 

The main focus of this paper is estimating the marginal 
likelihood (also termed the evidence) 

p{y) = J pio)fiy\o)de 

and Bayes’ factors: ratios of evidences for different mod¬ 
els (Ml and M2, say), BF 12 = p{y\Mi)/p{y\M2). These 
quantities cannot usually be estimated reliably from 
MGMG output, and commonly used methods for es¬ 
timating them require f{y\9) to be tractable in 9. This 
leads Friel (2013) to label their estimation as “triply 


intractable” when / has an ING. To our knowledge the 
only published approach to estimating the evidence for 


such models is in Friel (2013), with this paper also giv¬ 


ing one of the only approaches to estimating BFs in 
this setting. For estimating BFs, ABG provides a vi¬ 


able alternative (Grelaud et al 2009), at least for models 


within the exponential family. 


Friel (2013) starts from Chib’s approximation. 


p{y) = 


f{y\9)p{9) 

^{9\y) 


( 8 ) 


where 9 can be an arbitrary value of 9 and tt is an 
approximation to the posterior distribution. Such an 
approximation is intractable when / has an INC. |Friel| 
(2013) devises a “population” version of the exchange 


algorithm that simulates points from the posterior 
distribution, and which also gives an estimate Z{9^p^) 
of the INC at each of these points. The points 9^p^ can 
be used to find a kernel density approximation tt, and 
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estimates of the INC. These are then used in 

a number of evaluations of (|^ at points (generated by 
the population exchange algorithm) in a region of high 
posterior density, which are then averaged to find an 
estimate of the evidence. This method has a number of 
useful properties (including that it may be a more ef¬ 
ficient approach for parameter inference than the stan¬ 
dard exchange algorithm), but for evidence estimation 
it suffers the limitation of using a kernel density esti¬ 
mate which means that, as noted in the paper, its use 
is limited to low-dimensional parameter spaces. 

In this paper we explore the alternative approach 
of methods based on IS, making use of the likelihood 
approximations described earlier in this section. These 
IS methods are outlined in section [21 In section [2] we 
note the good empirical performance of an inexact- 
approximate method and examine such approaches in 
more detail. As IS is itself not readily applicable to high 
dimensional parameter spaces, in section we look at 
natural extensions to the IS methods based on SMC. 
Particular care is required when considering approxi¬ 
mations within iterative algorithms: we provide a pre¬ 
liminary study of approximation in this context demon¬ 
strating theoretically that the resulting error can be 
controlled uniformly in time, under very favorable as¬ 
sumptions. This, and the associated empirical study 
are intended to provide motivation and proof of con¬ 
cept; caution is still required if approximation is used 
within such methods in practice but the results pre¬ 
sented suggest that further investigation is warranted. 
The algorithms presented later in the paper are viable 
alternatives to the MCMC approaches to parameter es¬ 
timation described in this section, and may outperform 
the corresponding MCMC approach in some cases. In 
particular they all automatically make use of a pop¬ 
ulation of points, an idea previously explored in the 
MCMC context by jCaimo and Friel (2011) and Friel 
(2013). In section]^ we draw conclusions. 


2 Importance sampling approaches 

In this section we investigate the use of IS for estimat¬ 
ing the evidence and BFs for models with INCs. We 
consider an “ideal” importance sampler that simulates 
P points fro™ ^ proposal q{-) and calculates 

their weight, in the presence of an INC, using 


~(p) ^ p(gfr))7(j/|^fr)) 

^ q{e(p))Z{0(p)) ’ 

with an estimate of the evidence given by 


p{y) = 


(p) 


P 


(9) 


( 10 ) 


To estimate a BF we simply take the ratio of estimates 
of the evidence for the two models under considera¬ 
tion. However, the presence of the INC in the weight 
expression in © means that importance samplers can¬ 
not be directly implemented for these models. To cir¬ 
cumvent this problem we will investigate the use of 
the techniques described in section o in importance 
sampling. We begin by looking at exact-approximation 
based methods in section [2T1 We then examine the use 
to approximate likelihoods based on simulation, includ¬ 
ing ABC and SL in section |2.2[ before looking at the 
performance of all of these methods on a toy example 
in section [2731 Finally, in sections [2.4| and [276[ we exam¬ 
ine applications to exponential random graph models 
(ERGMs) and Ising models, the latter of which leads 
us to consider the use of inexact-approximations in IS 


(first introduced in section 2.5). 


2.1 Auxiliary variable IS 

To avoid the evaluation of the INC in ([^, we propose 
the use of the auxiliary variable method used in the 
MCMC context in section imi Specifically, consider 
IS using the SAV target 

pi0,u\y) oc quiu\9,y)f{y\0)p{9), 

noting that it has the same normalizing constant as 
p{9\y) oc f{y\9)p{9), with proposal 

q{9,u) = f{u\9)q{9). 

This results in weights 

qu{u\9^p\yh{y\0^^^)p{0^^^)z{9^p^) 


7(u|6»fr))g'(0fr)) Z(6»(P)) 

7(y|6>fr))p(6lfr)) quiu\9^P\y) 

q{0{p)) 7(u|0frl) 


rXp) 


and the estimate ( [10[ ) of the evidence. 

In this method, which we will refer to as single auxil¬ 
iary variable IS (SAVIS), we may view 
qu{u\9^P\y)/^{u\9^P'>) as an unbiased importance sam¬ 
pling (IS) estimator of 1/Z{9^p^). Although we are us¬ 
ing an unbiased estimator of the weights in place of 
the ideal weights, the result is still an exact importance 
sampler. SAVIS is an exact-approximate IS method, as 
seen previously in Fearnhead et al ( 2010| , Chopin et al 
(2013) and Tran et al (2013). As in the MCMC setting, 


p=i 


to ensure the variance of estimators produced by this 
scheme is not large we must ensure the variance of esti¬ 
mator of 1/Z(9^P^) is small. Thus in practice we found 
extensions to this basic algorithm were useful: using 
multiple u importance points for each proposed 0^^ as 




































6 


Richard G. Everitt et al. 


in ([^; and using AIS, rather than simple IS, for esti¬ 
mating i/z{e^p'>) as in Q (giving an algorithm that we 
refer to as multiple auxiliary variable IS (MAVIS), in 
common with the terminology in Murray et al (|2006[)). 


Using qui-\0,y) = f{-\0), as described in section 
and 7 fc as in ([^, we obtain 


1 . 1.1 


1 1 


M K 

En 




Z{0) Z{9) M 7fe_i(M^™^J6»*,6»,y) 


( 11 ) 


In this case the (A)IS methods are being used as unbi¬ 
ased estimators of the ratio Z{9)IZ(9) and again SMC 
could be used in their place. 


2.2 Simulation based methods 


Didelot et al (2011) investigate the use of the ABC 


approximation when using IS for estimating marginal 
likelihoods. In this case the weight equation becomes 


= 


(7(6»(p)) 


where 


f{-\9‘'P'>), and using the notation 
from section [l.r.4[ However, using these weights within 
(10) gives an estimate for p{S{y)) rather than, as de¬ 


sired, an estimate of the evidence p{y). 

Fortunately, there are cases in which ABC may be 


used to estimate BFs. Didelot et al (2011) establish 


that, for the BF for two exponential family models: if 
Si{y) is sufficient for the parameters in model 1 and 
S 2 {y) is sufficient for the parameters in model 2, then 
using S{y) = (S'i(y), S' 2 (y)) gives 

p{y\Mi) ^ p{S{y)\Mi) 

Piy\M2) p{S{y)\M2)' 

Outside the exponential family, making an appropri- 


ate choice of summary statistics is harder ( 

Robert et al 

2011 

Prangle et al||2014 Marin et al||2014 

I- 


Just as in the parameter estimation case, the use 
of a tolerance e > 0 results in estimating an approxi¬ 
mation to the true BF. An alternative approximation, 
not previously used in model comparison, is to use SL 


(as described in section 1.1.5). In this case the weight 
equation becomes 


r;(p) — 


p{9^P^ )Af(^S{y ); pgip ), ) 


q{9(p)) 


where pg , Sg are given by ([^ . As in parameter esti¬ 
mation, this approximation is only appropriate if the 
normality assumption is reasonable. The choice of sum¬ 
mary statistics is as difficult as in the ABC case. 


2.3 Toy example 

In this section we have discussed three alternative meth¬ 
ods for estimating BFs: MAVIS, ABC and SL. To fur¬ 
ther understand their properties we now investigate the 
performance of each method on a toy example. 

Consider i.i.d. observations y = {yi\^Zi of a dis¬ 
crete random variable that takes values in N. For such 
a dataset, we will find the BF for the models 

1. y\9 ~ Poisson(0), 9 = X Exp(l) 

fi iy\^) = n ^/exp(-nA) 

Vi- 


i=l 


2. y\9 ~ Geometric(0), 9 = p ^ Unif(0,1) 

n 

f2{y\0) = l[{i-pr/p-^- 

i=l 

In both cases we have rewritten the likelihoods /i and 
/2 in the form j{y\9)/Z{9) in order to use MAVIS. Due 
to the use of conjugate priors the BF for these two mod¬ 


els can be found analytically. As in Didelot et al (2011) 


we simulated (using an approximate rejection sampling 
scheme) 1000 datasets for which roughly 

uniformly cover the interval [0.01,0.99], to ensure that 
testing is performed in a wide range of scenarios. For 
each algorithm we used the same computational effort, 
in terms of the number of simulations (100, 000) from 
the likelihood (such simulations dominate the compu¬ 
tational cost of all of the algorithms considered). 

Our results are shown in figure[^ with the algorithm- 
specific parameters being given in figure We note 
that we achieved better results for MAVIS when: de¬ 
voting more computational effort to the estimation of 
1/Z{9) (thus we used only 100 importance points in 
d-space, compared to 1000 for the other algorithms); 
and using more intermediate bridging distributions in 
the AIS, rather than multiple importance points (thus, 
in equation 0 we used K = 1000 and M = 1). In 
the ABC case we found that reducing e much further 
than 0.1 resulted in many importance points with zero 
weight (note that here, and throughout the paper we 
use the uniform kernel for tt^). From the box plots in 
figure |la[ we might infer that overall SL has outper¬ 
formed the other methods, but be concerned about the 
number of outliers. Figures [Tb| to [Tdj shed more light on 
the situations in which each algorithm performs well. 

In figurewe observe that the non-zero e results in 
a bias in the BF estimates (represented by the shallower 
slope in the estimated BFs compared to the true val¬ 
ues) . In this example we conclude that ABC has worked 
quite well, since the bias is only pronounced in situa¬ 
tions where the true BF favours one model strongly over 
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(a) A box plot of the log of the estimated BF divided by 
the true BF. 



True log{BF) 


(c) The log of the BF estimated by SL-IS against the log 
of the true BF. 

Fig. 1: Bayes’ factors for the 



True log{BF) 

(b) The log of the BF estimated by ABC-IS against the 
log of the true BF. 



True log(BF) 

(d) The log of the BF estimated by MAVIS against the 
log of the true BF. 

Poisson and geometric models. 


the other, and this conclusion would not be affected by 
the bias. For this reason it might be more relevant in 
this example to consider the deviations from the shallow 
slope, which are likely due to the Monte Carlo variance 
in the estimator (which becomes more pronounced as e 
is reduced). We see that the choice of e essentially gov¬ 
erns a bias-variance trade-off, and that the difficulty 
in using the approach more generally is that it is not 
easy to evaluate whether a choice of e that ensures a 
low variance also ensures that the bias is not signifi¬ 
cant in terms of affecting the conclusions that might be 
drawn from the estimated BF (see section |2.4 1. Figure 


[^suggests that SL has worked extremely well (in terms 
of having a low variance) for the most important situa¬ 
tions, where the BF is close to 1. However, we note that 
the large biases introduced due to the limitation of the 
Gaussian assumption when the BF is far from 1. Figure 
|ld| indicates that there is little or no bias when using 


MAVIS, but that there is appreciable variance (due to 
using IS on the relatively high-dimensional it-space). 

These results highlight that the three methods will 
be most effective in slightly different situations. The ap¬ 
proximations in ABC and SL introduce a bias, the effect 
of which might be difficult to assess. In ABC (assuming 
sufficient statistics) this bias can be reduced by an in¬ 
creased computational effort allowing a smaller e, how¬ 
ever it is essentially impossible to assess when this bias 
is “small enough”. SL is the simplest method to imple¬ 
ment, and seems to work well in a wide variety of situa¬ 
tions, but the advice in Wood ( 20I0| should be followed 
in checking that the assumption of normality is appro¬ 
priate. MAVIS is limited by the need to perform im¬ 
portance sampling on the high-dimensional (0, u) space 
but consequently avoids specifying summary statistics, 
its bias is small, and this method is able to estimate the 
evidence of individual models. 
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ABC (e = 0.1) 

ABC (e = 0.05) 

SL 

MAVIS 

p(y| Vfi) 

P{v\M2) 

4 

20 

40 

41 


Table 1: Model comparison results for Gamaneg data. 
Note that the ABC (e = 0.05) estimate was based 
upon just 5 sample points of non-zero weight. MAVIS 
also provides estimates of the individual evidence 
(log[p(y|Mi)] = -69.6, log[p(y|M 2 )] = -73.3). 


2.4 Application to social networks 


In this section we use our methods to compare the evi¬ 
dence for two alternative ERGMs for the Gamaneg data 


previously analysed in Friel (2013) (who illustrate the 


data in their figure 3). An ERGM has the general form 
fiy\^) = exp {0'^S{y)) , 

where S{y) is a vector of statistics of a network y and 
0 is a parameter vector of the same length. We take 
S{y) = (# of edges ) in model 1 and S{y) = (# of 
edges, # of two stars) in model 2 . As in [Friel ( 2013| ) 
we use the prior p{9) = Af{9; 0, 25/). 

Using a computational budget of 10^ simulations 
from the likelihood (each simulation consisting of an 
internal MCMC run of length 1000 as a proxy for an ex¬ 


act sampler, as described in section 1.1.3), Friel (2013) 


finds that the evidence for model 1 is ~ 37x that for 
model 2. Using the same computational budget for our 
methods, consisting of 1000 importance points (with 
100 simulations from the likelihood for each point), we 
obtained the results shown in Table [H 

This example highlights the issue with the bias- 
variance trade-off in ABC, with e = 0.1 having too large 
a bias and e = 0.05 having too large a variance. SL per¬ 
forms well — in this particular case the Gaussian as¬ 
sumption appears to be appropriate. One might expect 
this, since the statistics are sums of random variables. 
However, we note that this is not usually the case for 
ERGMs, particularly when modelling large networks, 
and that SL is a much more appropriate method for 


inference in the ERGMs with local dependence (Sch- 


weinberger and Handcock||2015 ). A more sophisticated 
ABC approach might exhibit improved performance, 
possibly outperforming SL. However, the appeal of SL 
is in its simplicity, and we find it to be a useful method 
for obtaining good results with minimal tuning. 


2.5 IS with biased weights 


1 . An internal MCMC chain was used in place of an 
exact sampler; 

2 . The 1/Z{9) term in ( [TT] ) was estimated before run¬ 
ning this algorithm (by using a standard SMC method, 
with initial distribution being the Bernoulli random 
graph (which can be simulated from exactly) and 
hnal distribution oc 7 (j 0 ) to estimate Z{9) (being 
the normalising constant of 7 ), and taking the recip¬ 
rocal) with this fixed estimate being used through¬ 
out. 


However, in practice, we tend to find that such “inexact- 
approximations” do not introduce large errors into 
Bayes’ factor estimates, particularly when compared to 
standard implementations of ABC (as seen in the pre¬ 
vious section). 

This example suggests that in practice it may some¬ 
times be advantageous to use biased rather than un¬ 
biased estimates of importance weights within a ran¬ 
dom weight IS algorithm: an observation that is some¬ 
what analogous to that made in Alquier et al (2015) in 
the context of MCMC. This section provides an initial 
theoretical exploration as to whether this might be a 
useful strategy in IS. 

In order to analyse the behaviour of importance 
sampling with biased weights, we consider biased es¬ 
timates of the weights in equation (10). Let 


w{9) := 


pi&h{y\9) 

Zi9)qi9) • 


We consider biased randomised weights that admit an 
additive decomposition, 


w{9) := w{9) -(- b{9) + Vg, 


in which b{9) = E['u;(0)|0] — 1 (;( 0 ) is a deterministic func¬ 
tion describing the bias of the weights and Vg is a ran¬ 
dom variable (more precisely, there is an independent 
copy of such a random variable associated with every 
particle), which conditional upon 9 is of mean zero and 
variance cr^ = Var(zl;(0)|0). This decomposition will not 
generally be available in practice, but is flexible enough 
to allow the formal description of many settings of in¬ 
terest. For instance, one might consider the algorithms 
presented here by setting b{9) to the (conditional) ex¬ 
pected value of the difference between the approximate 
and exact weights and Vg to the difference between the 
approximate weights and their expected value. 

We have immediately that the bias of such an es¬ 
timate is, using a subscript of q to denote expectations 
and variances with respect to q{9), Eg[6(0)]. By a simple 
application of the law of total variance, its variance is 


The implementation of MAVIS in the previous section ^ 2 

is not an exact-approximate method for two reasons: pVar,j(d;( 0 )) = — {Var^ H0) + 6(0)]+E, [bl]} 
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Consequently, the mean squared error of this estimate 
is: 

1 {Var, [ta(0) + 6(0)] + EJa^]} + E,[6(0)]2. 

If we compare such a biased estimator with a second es¬ 
timator in which we use the same proposal distribution 
but instead use an unbiased random weight 

w{9) := w{9) + V{9), 

where V(9) has conditional expectation zero and vari¬ 
ance (Jg, then it’s clear that the biased estimator has 
smaller mean squared error for small enough samples if 
it has sufficiently smaller variance, i.e., when (assum¬ 
ing Eg[6(0)]^ > 0, otherwise one estimator dominates 
the other for all sample sizes): 

i {Varg [a;(0) + 6(0)] + Egla^]} + Eg[6(0)]2 

{Varg [w{9)]+^q[&l]] 

which holds when P is inferior to 

- Varg [6(0)] - 2CoVg [w(0), 6(0)] 

Eg[6(0)]2 

In the artificially simple setting in which 6(0) = 6o 
is constant, this would mean that the biased estim¬ 
ator would have smaller MSE for samples smaller than 
the ratio of the difference in variance to the square 
of that bias suggesting that qualitatively a biased es¬ 
timator might be better if the square of the average 
bias is small in comparison to the variance reduction 
that it provides. Given a family of increasingly expens¬ 
ive biased estimators with progressively smaller bias, 
one could envisage using such an argument to manage 
the trade-off between less biased estimators and larger 
sample sizes. In practice a negative covariance between 
6(0) and w(9) might also lead to favourable perform¬ 
ance by biased estimators. 


2.6 Applications to Ising models 


In the current section we investigate this type of ap¬ 
proach further empirically, estimating Bayes’ factors 
from data simulated from Ising models. In particular 


we reanalyse the data from Friel (20131, which consists 


of 20 realisations from a first-order 10 x 10 Ising model 
and 20 realisations from a second-order 10 x 10 Ising 


model for which accurate estimates (via Friel and Rue 


(20071) of the evidence serve as a ground truth for com¬ 


parison. We also analyse data from a 100 x 100 Ising 
model. 


2.6.1 10 X 10 Ising models 


As in the toy example, we examine several different con¬ 
figurations of the IS and AIS estimators of the Z{9)/Z (0) 
term in the weight (§, using different values of M, K 
and B, the burn in of the internal MCMC, that yield 
the same computational cost (in terms of the number 
of Gibbs sweeps used to simulate from the likelihood). 
Note that for small values of B these estimators are 
biased; a bias that decreases as B increases. 


The empirical results in Friel (2013), use a total 
2 X 10^ Gibbs sweeps to estimate one Bayes’ factor, 
to allow comparison of our results with those in that 
paper. Here, estimating a marginal likelihood is done 
in three stages: firstly 0 is estimated; followed by Z{9), 
then finally the marginal likelihood. We took 0 to be 
the posterior expectation, estimated from a run of the 
exchange algorithm of 10,000 iterations. Z{9) was then 
estimated using SMG with an MGMG move, with 200 
particles and 100 targets, with the ith target being 
7i(-|0) = 7 i (- 110 / 100 ), employing stratified resampling 
when the effective sample size (ESS; [i^ng et al (1994)) 
falls below 100. The total cost of these three stages is 
5 X 10® Gibbs sweeps (1/4 of the cost of population ex¬ 
change) with the final IS stage costing 2 x 10^ sweeps 
(1/1000 of the cost of population exchange). We note 
that the cost of the first two stages has been chosen 
conservatively - less computational effort here can also 
yield good results. The importance proposal used in 
all cases was a multivariate normal distribution, with 
mean and variance taken to be the sample mean and 
variance from the initial run of the exchange algorithm. 
This proposal would clearly not be appropriate in high 
dimensions, but is reasonable for the low dimensional 
parameter spaces considered here. Figure shows the 
results produced by these methods in comparison with 
those from Friel (2013). 


We observe: improvements of the new methods over 
population exchange; an overall robustness of the new 
methods to different choices of parameters; and that 
there is a bias-variance tradeoff in the “internal” estim¬ 
ate of Z{9)/Z{9) in terms of producing the best beha¬ 
viour of the Bayes’ factor estimates. Recall that as B 
increases the bias of the internal estimate (the results of 
which can be observed in the results when using B = 0) 
decreases, but for a fixed computational effort it is bene¬ 
ficial to use a lower B and to instead increase M, using 
more importance points to decrease the variance. As in 
), we observe that it may be useful 
the exact-approximate approaches, 
and in this case, to simply use the best available es¬ 
timator of Z{9)IZ(9) (taking into account its statist¬ 
ical and computational efficiency) regardless of whether 


Alquier et al (2015 


to move away from 
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Figure 2: Box plots of the results of population exchange, SAVIS, and MAVIS on the Ising data. 


it is unbiased. In this example there is little observed 
difference in using our fixed computational budget on 
more AIS moves (AT) in place of using more importance 
points (M). In general we might expect using more AIS 
moves to be more productive when the estimates of the 
Z{d)IZ{9) for 9 far from 9 are required. 


the need to find good estimates of Z{9)/Z{9) that are 
here normalising constants of distributions on a space 
of higher dimensions. However, since the posterior has 
lower variance in this case, only values of 9 close to 9 
are proposed, which makes estimating Z{9)/Z{9) much 
easier, yielding the good results in this section. 


2.6.2 100 X 100 Ising model 

In this section we use SAVIS for estimating the mar¬ 
ginal likelihood for a first order Ising model on data 
of size 100 X 100 pixels simulated from an Ising model 
with parameter 9 = 10. Again, estimating a marginal 
likelihood is done in three stages: firstly 9 is estimated; 
followed by Z{9), then finally the marginal likelihood. 
The methods use for the first two stages are identical 
to those used in section |2.6.H as is the choice of pro¬ 
posal distribution. The third stage is performed using 
SAVIS with M = 100 and B = 20. From 20 runs of this 
third stage, a five-number summary of the log evid¬ 
ence estimates was (-5790.251, -5790.178, -5790.144, - 
5790.119, -5790.009), with the average ESS being 80.75. 
Note the low variance over these runs of the algorithm 
and the high ESS, which were also found for different 
configurations of the algorithm (including for more im¬ 
portance points and larger values of M and B). One 
might expect this example to be more difficult than the 
10 X 10 grids considered in the previous section, due to 


2.7 Discussion 


In this section we have compared the use of ABC-IS, 
SL-IS, MAVIS (and alternatives) for estimating mar¬ 
ginal likelihoods and Bayes’ factors. The use of ABC 
for model comparison has received much attention, with 
much of the discussion centring around appropriate 
choices of summary statistics. We have avoided this in 
our examples by using exponential family models, but 
in general this remains an issue affecting both ABC and 
SL. It is the use of summary statistics that makes ABC 
and SL unable to provide evidence estimates. How¬ 
ever, it is the use of summary statistics, usually es¬ 
sential in these settings, that provides ABC and SL 
with an advantage over MAVIS, in which importance 
sampling must be performed over the high dimensional 
data-space. Despite this disadvantage, MAVIS avoids 
the approximations made in the simulation based meth¬ 
ods (illustrated in figures lb to Id with the accuracy 


depending primarily on the quality of the estimate of 
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XjZ used). In section 2.6 we saw that there can be ad¬ 
vantages of using biased, but lower variance estimates 
in place of standard IS. 

The main weakness of all of the methods described 
in this section is that they are all based on standard IS 
and are thus not practical for use when 9 is high dimen¬ 
sional. In the next section we examine the use of SMC 
samplers as an extension to IS for use on triply intract¬ 
able problems, and in this framework discuss further 
the effect of inexact approximations. 


3 Sequential Monte Carlo approaches 


SMC samplers (Del Moral et al 20061 are a general¬ 


isation of IS, in which the problem of choosing an ap¬ 
propriate proposal distribution in IS is avoided by per¬ 
forming IS sequentially on a sequence of target distri¬ 
butions, starting at a target that is easy to simulate 
from, and ending at the target of interest. In standard 
IS the number of Monte Carlo points required in order 
to obtain a particular accuracy increases exponentially 
with the dimension of the space, but |Beskos et al ( 2011| 
show (under appropriate regularity conditions) that the 
use of SMC circumvents this problem and can thus be 
practically useful in high dimensions. 

In this section we introduce SMC algorithms for 
simulating from doubly intractable posteriors which have 
the by-product that, like IS, they also produce estim¬ 
ates of marginal likelihoods. We note that, although 
here we focus on estimating the evidence, the SMC sam¬ 
pler approaches based here are a natural alternative to 
the MCMC methods described in section fLlI and inher¬ 
ently use a “population” of Monte Carlo points (shown 
to be beneficial on these models by |Caimo and Friel] 
(2011)). In section 3.1 we describe these algorithms. 


before examining an application to estimating the pre¬ 
cision matrix of a Gaussian distribution in high dimen¬ 
sions in section |3.2[ In |3.4| we provide a preliminary in¬ 
vestigation of the consequences of using biased weight 
estimates in an SMC framework. 


3.1 SMC samplers in the presence of an INC 

This section introduces two alternative SMC samplers 
for use on doubly intractable target distributions. The 
first, marginal SMC, directly follows from the IS meth¬ 
ods in the previous section. The second, SMC-MCMC, 
requires a slightly different approach, but is more com¬ 
putationally efficient. Finally we briefly discuss 
simulation-based SMC samplers in section [3.1.2| 

To begin, we introduce notation that is common to 
all algorithms that we discuss. SMC samplers perform 


sequential IS using P “particles” 9^p\ each having (nor¬ 
malised) weight w^'P\ using a sequence of targets ttq to 
ttt, with TTx being the distribution of interest, in our 
case 7r(0|j/) oc p{9)f{y\d). In this section we will take 
TTt{9\y) oc p{9)ftly\9) = p{9)'yt{y\0)/Zt{9). At target t, 
a “forward” kernel Kt{-\9[^i) is used to move particle 
to with each particle then being reweighted 
to give unnormalised weight 


Ap) 




‘ p(0S)7*-i(y|0a) zM^^) ' 

Here, Lt-i represents a “backward” kernel that we chose 
differently in the alternative algorithms below. We note 
the presence of the INC, which means that this al¬ 
gorithm cannot be implemented in practice in its cur¬ 
rent form. The weights are then normalised to give 
I, and a resampling step is carried out. In the fol¬ 
lowing sections the focus is on the reweighting step: this 
is the main difference between the different algorithms. 
For more detail on these methods, see [Del Moral et~al| 

(1M7|. 


Zhou et al (2015) describe how BFs can be estim¬ 


ated directly by SMC samplers, simply by taking tti to 
be one model and ttt to be the other (with the being 
intermediate distributions). This idea is also explored 
for Gibbs random fields in |Friel| (2013). However, the 


empirical results in Zhou et al (2015) suggest that in 


some cases this method does not necessarily perform 
better than estimating marginal likelihoods for the two 
models separately and taking the ratio of the estimates. 
Here we do not investigate these algorithms further, but 
note that they offer an alternative to estimating the 
marginal likelihood separately. 


3.1.1 Random weight SMC Samplers 


SMC with an MCMC kernel Suppose we were able to 
use a reversible MCMC kernel Kt with invariant dis¬ 
tribution Trt{6\y) cx pi9)ftiy\9), and choose the Lt-i 
kernel to be the time reversal of Kt with respect to 
its invariant distribution, we obtain the following in¬ 
cremental weight: 

7t(2/|0S) 


r,ip) 


7t-i(2/|0S) Ztiei^J,) 


( 12 ) 


Once again, we cannot evaluate this incremental weight 
due to the presence of a ratio of normalising constants. 
Also, such an MCMC kernel cannot generally be dir¬ 
ectly constructed — the MH update itself involves eval¬ 
uating the ratio of intractable normalising constants. 
However, appendix shows that precisely the same 
weight update results when using either SAV or ex¬ 
change MCMC moves in place of a direct MCMC step. 
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In order that this approach may be implemented 
we might consider, in the spirit of the approximations 
suggested in section using an estimate of the ratio 
term For example, an unbiased IS 

estimate is given by 


Analogous to the SAV method, a sensible choice for 
qw{w) might be to use / , where w is on the same 

space as N/T data points. The normalising constant 
for this distribution needs to be known to calculate the 


importance weight in (19) so, as earlier, we advocate 


M 








(13) 


where ~ Although this estimate is un¬ 

biased, we note that the resultant algorithm does not 
have precisely the same extended space interpretation 
as the methods in Del Moral et al ( 2006[ ). Appendix [B] 
gives an explicit construction for this case, which incor¬ 
porates a pseudomarginal-type construction ( |Andrieu| 
and Roberts||2009 ). 


Data point tempering For the SMC approach to be effi¬ 
cient we require that the sequence of distributions {nt} 
be chosen such that ttq is easy to simulate from, ttt is 
the target of interest and the intermediate distributions 
provide a “route” between them. For the applications 
in this paper we found the data tempering approach of 


estimating this in advance of running the SMC sampler 
(aside from when the data points are added one at a 
time - in this case the normalising constant may usu¬ 
ally be found analytically). Note that if y does not con¬ 
sist of i.i.d. points, it is useful to choose the order in 
which data points are added such that the same 
(each with the same normalising constant) can be used 
in every weight update. For example, in an Ising model, 
the requirement would be to add the same shape grid 
of variables at each target. 

Marginal SMC An alternative method commonly used 
in ABC applications arises from the use of an approx¬ 


imation to the optimal backward kernel (Peters 2005 


Klaas et~^|2005|). In this case the weight update is 


n(P) - 


p{9nit{y\en 




(17) 


Chopin (20021 to be particularly useful. Suppose that 
the data y consists of N points, and that N is ex¬ 
actly divisible by T for ease of exposition. We then take 
= P{S) and for t = 1, ...T Trt{9\y) = p{9)ft{y\9) 

with 


MvW) = f {yi-.Nt/T\d) , 


(14) 


i.e. essentially we incorporate N/T additional data points 
for each increment of t. On this sequence of targets we 
then propose to use the SMC sampler with an MCMC 
kernel as described in the previous section. The only 

slightly non-standard point is the estimation of Zt-i{9[^^)/ 

Zt{9l^i), since in this case ^t_i(d|!(\) and Zt{9^^\) are 
the normalising constants of distributions on different 
spaces. We use 


for an arbitrary forward kernel Kf. This results in a 
computational complexity of 0{P^) compared to 0{P) 
for a standard SMC method, but we include it here 
in order to note that the 1/Z{-) term in could be 
dealt with in the same way as in the simple IS case. 
Considering the SAVM posterior, where in target t we 
use the distribution for the auxiliary variable Ut, and 
the SAVM proposal, where 
at the weight update: 


(p) 


/t(-|0(^^) we arrive 


wY’ 


Qu I O'f ’, y)p{9[^^ )7t (y I ) 


i(p)', 




(r) 


o(p)in(r) 


Zt-i(9E\) 

Zt(0Ei) 


1 ^ 

= -y 

M 




in which normalising constant appears in this weight 
update. We include this approach for completeness but 
do not investigate it further in this paper. 

3.1.2 Simulation-based SMC samplers 


'yt{u 


(■m,p) ifl(p) 




(15) 


where ~ /t(-|^t((.\) and and are 

subvectors of is in the space of the ad¬ 

ditional variables added when moving from ft-i to ft 
(providing the argument in an arbitrary auxiliary dis¬ 
tribution qwi’)) and y^’^^ is in the space of the existing 
variables. For t = 1 this becomes 


Section |2.2| describes how the ABC and SL approxim¬ 
ations may be used within IS. The same approximate 


likelihoods may be used in SMC. In ABC (Sisson et al 
2007|), where the sequence of targets is chosen to be 


TTt{9) (X p{9)fet{y\9) with a decreasing sequence et, this 
idea provides a useful alternative to MCMC for explor¬ 
ing ABC posterior distributions, whilst also providing 
estimates of Bayes’ factors (Didelot et al 2011). The 


1 


zMy 


M 




=1 7i(i 


1C) 


(16) 


with m)' 


{m,p) 




ip)\ 


use of SMC with SL does not appear to have been ex¬ 
plored previously. One might expect SMC to be use¬ 
ful in this context (using, for example, the sequence of 
targets TTt{9) oc p{9)f^^'^ {S{y)\9)), particularly when 
/SL is concentrated relative to the prior. 
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3.2 Application to precision matrices 

In this section we examine the performance of the SMC 
sampler, with MCMC proposal and data-tempered tar¬ 
get distributions, for estimating the evidence in an ex¬ 
ample in which 9 is of moderately high dimension. We 
consider the case in which 9 = is an unknown 

precision matrix, f{y\9) is the d-dimensional multivari¬ 
ate Gaussian distribution with zero mean and p{9) is 
a Wishart distribution V) with parameters v > d 
and V € Suppose we observe n i.i.d. observations 

2/ = {diliLi’ where yi G The true evidence can be 
calculated analytically, and is given by 


Systematic resampling was performed when the effec¬ 
tive sample size (ESS) fell below P/2. 

We estimated the evidence 10 times using the SMC 
sampler and compared the statistical properties of each 
algorithm using these estimates. For our simulated data, 
the log of the true evidence was —89.43. Over the 10 
runs of the SMC sampler a five-number summary of the 
log evidence estimates was (—90.01, —89.51, —89.35, 
-88.92, -88.37). 


3.3 Application to Ising models 


p(y) = 


1 


(V ^ + ELiyi9n' 


Yndj2 


U'i) 




(18) 


where denotes the d-dimensional gamma function. 
For ease of implementation, we parametrise the preci¬ 
sion using a Cholesky decomposition = LL' with 
L a lower triangular matrix whose (j,j)’th element is 
denoted a,- 




As in section 2.3 we write f{y\9) as ^{y\9)/Z{9) as 
follows 


/ ({2/J”=i I = |2^r| 


|-n/2 


/I 
exp 

\ i=l 


-1 


yi 


where in some of the experiments that follow, Z{9) = 
|27rA’|"^^ is treated as if it is an INC. In the Wishart 
prior, we take v = 10 + d and V = Id- 

Taking d = 10, n = 30 points were simulated using 
yi ~ AlVAf (Od, 0.1 X Id). The parameter space is thus 
55-dimensional, motivating the use of an SMC sampler 
in place of IS or the population exchange method, nei¬ 
ther of which are suited to this problem. In the SMC 
sampler, in which we used P = 10, 000 particles, the 
sequence of targets is given by data point tempering. 
Specifically, the sequence of targets is to use p{E~^) 
when t = 0 andp{E~^)f | for t = 1,..., T 

(with T = n). The parameters are {oy | 1 < J < * < d}. 
We use single component MH kernels to update each of 
the parameters, with one (deterministic) sweep consist¬ 
ing of an update of each in turn. For each we use 
a Gaussian random walk proposal, where at target t, 
the variance for the proposal used for a^- is taken to 
be the sample variance of a^- at target t — 1. For up¬ 
dating the weights of each particle we used equation 
i where we chose qw{ ) = / ^' I with the 

maximum likelihood estimate of the precision , and 
chose M = 200 “internal” importance sampling points. 


In this section we apply the random weight SMC sam¬ 
pler to the Ising model data considered in section [2.6.1| 
We use SMC to estimate the marginal likelihood of both 
the first and second order Ising models, then take the 
ratio of these estimates to estimate the Bayes’ factor. 
Note that in this case the size of the parameter space is 
much smaller than in the precision example, with the 
models having parameter spaces of sizes 1 and 2 respec¬ 
tively. The excellent results achieved by IS in section 
|2.6.1| might seem to imply that SMC samplers are not 
required for this problem, but recall that we required 
preliminary runs of the exchange algorithm in order to 
design an appropriate importance proposal, along with 
an SMC sampler in order to estimate the normalising 
constant Z{9) of the distribution used for the aux¬ 
iliary variables An SMC sampler offers a cleaner 

approach that requires less user tuning. 

We applied the random weight SMC sampler de¬ 
scribed in section |3.1.H with 500 particles, data point 
tempering (adding one pixel at a time, taking q-^i to be 
Bern(0.5)), and using the estimate of the ratio of nor¬ 
malising constants in the weight update from equation 


(15) with M = 20 importance points. Each of these esti¬ 
mates requires simulating a single point from 7 t(-| 0 (!)\) 
using a Gibbs sampler, which had a burn in of B = 10 
iterations, yielding a total computational budget of 200 
Gibbs sweeps for estimating the ratio of normalising 
constants. Note that, as considered in section |2.6.1[ 
this use of a Gibbs sampler results in an inexact al¬ 
gorithm, but this level of burn in was found to be suffi¬ 
cient for this bias to be minimal in the random weight 
IS algorithms. The MCMG kernel of the exchange algo¬ 
rithm was used (with proposal taken to be the sample 
variance of the set of particles at each SMG iteration), 
using the approximate version where a Gibbs sampler 
with burn in B = 10 iterations is used to simulate from 
7 j(-| 0(*)). The total cost of this algorithm is comparable 
to the IS approaches in section [2. 6. 1[ with a total cost 
of 5.25 X 10® Gibbs sweeps and hence around a quarter 
of that of the algorithm of Friel (2013[). Figure]^ shows 
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Fig. 3: Box plots of the results of population exchange 
and random weight SMC. 


the results produced by this method in comparison with 


those from Friel (2013). 


We observe that the median of the random weight 
SMC estimates is more accurate than that of the popu¬ 
lation exchange estimates - the bias introduced through 
using an internal Gibbs sampler in place of an exact 
sampler does not appear to accumulate sufficiently to 
affect the results (this issue is explored further in the 
following section). However, it has slightly higher vari¬ 
ance than population exchange (much higher than SAVIS 
and MAVIS). This high variance can be attributed to 
two factors: 

1. Since the SMC sampler begins with points sampled 
from the prior, larger changes in 9 are considered 
than in the IS approaches, thus the estimates of the 
ratio of the normalising constants require more im¬ 
portance points to be accurate - the results suggest 
that the budget of 200 Gibbs sweeps is insufficient. 
This is the opposite situation to that encountered 
in section |2.6.2[ where the changes in 9 are small 
and the estimates of the ratio of the normalising 
constants are accurate with small numbers of im¬ 
portance points. 


2. It’s been frequently observed (cf. Lee and Whiteley 


(2015)) that, as suggested by the asymptotic vari¬ 
ance expansion, in some instances the first few iter¬ 
ations of an SMC sampler contribute substantially 
to the ultimate error. This issue arises since the for¬ 
getting of the sampler doesn’t suppress the terms 
that the initial errors contribute to the asymptotic 
variance enough to compensate for the fact that 
they’re much larger than the final ones. This is due, 
when using data point tempering in the manner we 


have here, to the much larger relative discrepancy 
between the first few distributions in the sequence 
than between later distributions. 

We conclude that the random weight SMC method 
is a viable approach to estimating Bayes’ factors for 
these models, but that care should be taken in tuning 
the weight estimates and choosing the sequence of SMC 
distributions. 


3.4 Biased Weights in SMC 
3.4-1 Error bounds 

We now examine the effect of using inexact weights on 
estimates produced by SMC samplers. By way of theo¬ 
retical motivation of such an approach, we demonstrate 
that under strong, but standard (cf. [Del Moral (2004)), 
assumptions on the mixing of the sampler, if the ap¬ 
proximation error is sufficiently small, then this error 
can be controlled uniformly over the iterations of the 
algorithm and will not accumulate unboundedly over 
time (and that it can in principle be made arbitrarily 
small by making the relative bias small enough for the 
desired level of accuracy). We do not here consider the 
particle system itself, but rather the sequence of distri¬ 
butions which are being approximated by Monte Carlo 
in the approximate version of the algorithm and in 
the idealised algorithm being approximated. The Monte 
Carlo approximation of this sequence can then be un¬ 
derstood as a simple mean field approximation and its 


convergence has been well studied, see for example Del Moral 


(2004). 


In order to do this, we make a number of identifi¬ 
cations in order to allow the consideration of the ap¬ 
proximation in an abstract manner. We allow Gt to 
denote the incremental weight function at time t, and 
Gt to denote the exact weight function which it ap¬ 
proximates (any auxiliary random variables needed in 
order to obtain this approximation are simply added 
to the state space and their sampling distribution to 
the transition kernel). The transition kernel Mt com¬ 
bines the proposal distribution of the SMC algorithm 
together with the sampling distribution of any needed 
auxiliary variables. We allow x to denote the full col¬ 
lection of variables sampled during an iteration of the 
sampler, which is assumed to exist on the same space 
during each iteration of the sampler. 

We employ the following assumptions (we assume 
an infinite sequence of algorithm steps and associated 
target distributions, proposals and importance weights; 
naturally, in practice only a finite number would be em- 
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ployed but this formalism allows for a straightforward 
statement of the result): 

Al (Bounded Relative Approximation Error) There ex¬ 
ists 7 < oo such that: 


sup sup 

tGN X 


\Gt{x)^-Gt{x)\ 

Gt{x) 


< 7- 


A2 (Strong Mixing; slightly stronger than a global Doe- 
blin condition) There exists e(M) > 0 such that: 


. dMt{x, •) 

supmt ———-r 

tgN dMt(y, •) 


> e(M). 


A3 (Control of Potential) There exists e(G) > 0 such 
that: 


sup inf 


Gt{x) 


teN Gt{y) 


> <G). 


The first of these assumptions controls the error intro¬ 
duced by employing an inexact weighting function; the 
others ensure that the underlying dynamic system is 
sufficiently ergodic to forget it’s initial conditions and 
hence limit the accumulation of errors. We demonstrate 
below that the combination of these properties suffices 
to transfer that stability to the approximating system. 

We consider the behaviour of the distributions rip 
and rip which correspond to the target distributions 
at iteration p of the exact and approximating algo¬ 
rithms, prior to reweighting, at iteration p in the fol¬ 
lowing proposition, the proof of which is provided in 
Appendix which demonstrates that if the approxi¬ 
mation error, 7 , is sufficiently small then the accumu¬ 
lation of error over time is controlled: 


Proposition 1 (Uniform Bound on Total-Variation 
Discrepancy). If Al, A2 and AS hold then: 


sup\\r]n- rinWyy^ 
new 


47(1 - e(M)) 
e3(M)e(G) 


This result is not intended to do any more than 
demonstrate that, qualitatively, such forgetting can pre¬ 
vent the accumulation of error even in systems with “bi¬ 
ased” importance weighting potentials. In practice, one 
would wish to make use of more sophisticated ergod- 
icity results such as those of Whiteley (20131, within 
this framework to obtain results which are somewhat 
more broadly applicable: assumptions A 2 and A3 are 
very strong, and are used only because they allow sta¬ 
bility to be established simply. Although this result is, 
in isolation, too weak to justify the use of the approx¬ 
imation schemes introduced here in practice, together 
with the empirical results presented below, it does sug¬ 
gest that further investigation of such approximations 
is warranted particularly in settings in which unbiased 
estimators are not available. 


3.4-2 Empirical results 


We use the precision example introduced in section |3.2| 
to investigate the effect of using biased weights in SMC 
samplers. Specihcally we take d = 1 and use a sim¬ 
ulated dataset y where n = 5000 points were simu¬ 
lated using yi ^ N (0, 0.1). In this case there is only a 
single parameter to estimate, oi, and we examine the 
bias of estimates of the evidence using four alternative 
SMC samplers, each of which use a data-tempered se¬ 
quence of targets (adding one data point at each tar¬ 
get). For this data we can calculate analytically the 
true value of the marginal likelihood after receiving 
each data point, thus we can estimate the bias of each 
sampler at each iteration. The first SMC sampler (the 
“exact weight” sampler) is the method where the true 
value of is used in the weight up¬ 

date. The second is the same “unbiased random weight” 
sampler used in section |3.2[ which uses an unbiased 
IS weight estimate, here with M = 20 “internal” im¬ 
portance sampling points. The third, which we refer to 
as the “biased random weight” sampler, uses a biased 
bridge estimator instead, specifically we use in place of 

(TU) 




{m,p)x 









(m,p)N 

t,2 ) 



(19) 


where ~ ~ q^i-) so that 

= (»<"’■■>,»!”’«), and 4 “'-' ~ /,(.|»< 7 ) with 
and being the corresponding subvectors 

Motivated by the theoretical argument presented 
previously, we investigate the effect of improving the 
mixing of the kernel used within the SMC. In this model 
the exact posterior is available at each SMC target, so 
we may replace the use of an MCMC move to update 
the parameter with a direct simulation from the poste¬ 
rior. In this extreme case, there is no dependence be¬ 
tween each particle and its history; we refer to this, the 
fourth SMC sampler we consider, as “biased random 
weight with perfect mixing”. Each SMC sampler was 
run 20 times, using 50 particles. 

Figures and show the estimated bias and mean 
square error of the log evidence estimates of each sam¬ 
pler at each iteratiorQ No bias is observed in the al- 


3 We note that log of an unbiased estimate in fact produces 
a negatively-biased estimator but we observe, through the 
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Fig. 4: The estimated bias in the log evidence estimates 
of the true (black solid), unbiased random weight (black 
dashed), biased random weight (grey solid) SMC algo¬ 
rithms using MCMC kernels, and the estimated bias 
when using the biased random weight algorithm with 
perfect mixing (grey dashed). 

gorithm with true weights, and only a small bias is ob¬ 
served in the unbiased random weight sampler (this bias 
is likely to be due to the relatively small number of repli¬ 
cations). Bias does accumulate in the biased random 
weight sampler, but we note that the level of bias ap¬ 
pears to stabilise. This accumulation of bias means that 
one should exercise caution in the use of SMC samplers 
with biased weights. However, we observe that perfect 
mixing substantially decreases the bias in the evidence 
estimates from the algorithm. Also, in this case we ob¬ 
serve that the bias does not accumulate sufficiently to 
give poor estimates of the evidence. Here the standard 
deviation of the final log evidence estimate over the ran¬ 
dom weight SMC sampler runs is approximately 0.4, so 
the bias is not large by comparison. 

3.5 Discussion 

In section [2^ we observed clearly that the use of biased 
weights in IS can be useful for estimating the evidence 
in doubly intractable models, but we have not observed 
the same for SMC with biased weights. When applied 
to the precision example in section an inexact sam¬ 
pler (using the bridge estimator) did not outperform 
the exact sampler, despite the mean square error of the 

results for the exact algorithm indicate that the variance of 
the evidence estimates we use is sufficiently small that this 
effect is negligible. 



Fig. 5: The estimated MSE in the log evidence estimates 
of the four SMC samplers (same key as figure . 

biased bridge weight estimates being substantially im¬ 
proved compared to the unbiased IS estimate. Over 10 
runs the mean square error in the log evidence was 0.34 
for the inexact sampler, compared to 0.28 for the exact 
sampler. This experience suggests that samplers with 
biased weights should be used with caution: weight es¬ 
timates with low variance do not guarantee good per¬ 
formance due to the accumulation of bias in the SMC. 

However, the theoretical and empirical investigation 
in this section suggests that this idea is worth further 
investigation, possibly for situations involving some of 
the other intractable likelihoods listed in section □ Our 
results suggest that improved mixing can help combat 
the accumulation of bias, which may imply that there 
may be situations where it is useful to perform many 
iterations of a kernel at a particular target, rather than 
the more standard approach of using many intermedi¬ 
ate targets at each of which a single iteration of a kernel 
is used. Other variations are also possible, such as the 
calculation of fast cheap biased weights at each target 
in order only to adaptively decide when to resample, 
with more accurate weight estimates (to ensure accu¬ 
rate resampling and accurate estimates based on the 
particles) only calculated when the method chooses to 
resample. 

4 Conclusions 

This paper describes several IS and SMC approaches for 
estimating the evidence in models with INCs that out¬ 
perform previously described approaches. These meth¬ 
ods may also prove to be useful alternatives to MCMC 
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for parameter estimation. Several of the ideas in the 
paper are also applicable more generally, in particular 
the use of synthetic likelihood in the IS context and the 
notion of using biased weight estimates in IS and SMC. 
We note that the bias in these biased weight methods 
may be small compared to errors resulting from com¬ 
monly accepted approximate techniques such as ABC. 

For biased IS, in section 2.5 we show that the error of 
estimates from low-variance biased methods can be less 
than those from unbiased methods of higher variance. 
This is comparable to a result for biased MCMC meth¬ 
ods (Johndrow et al 2015), where it is shown that the 
error of estimates from a computationally cheap biased 
MCMC can be less than those from an expensive ex¬ 
act MCMC. In both cases, given a finite computational 
budget, it is not always the case that this budget should 
be spent on guaranteeing the exactness of the sampler 
if minimizing approximation error is the objective. 

A similar choice concerning the allocation of com¬ 
putational resources arises in SMC. Here, one does need 
to be especially careful about the use of biased SMC, 
due to the possible accumulation of bias over SMC it¬ 
erations. One might expect this accumulated bias to 
outweigh any benefits a reduced variance may bring. 
For this reason we advise caution in the use of biased 
SMC in general. This paper does, however, indicate 
that there may exist cases where biased SMC is useful, 
through: the theoretical result that under strong mix¬ 
ing conditions bias does not accumulate unboundedly; 
the empirical evidence that fast mixing may reduce the 
accumulation of bias; and the empirical results where 
we observe (in a situation where the distance between 
successive targets decreases) that the rate at which bias 
accumulates decreases with time. 
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A Using SAV and exchange MCMC within 
SMC 


A.l Weight update when using SAV-MCMC 

Let us consider the SAVM posterior, with K being the MCMC 
move used in SAVM. In this case the weight update is 

~(p) ^ }ft )<?n, y) 

p{0i22ft-2y\0[^\2^{'u-22i \, y) 

it{y\e[22 z,.^(0[22 


7t-i(y|0S) Zt(6222 

which is the same update as if we could use MCMC directly. 


A.2 Weight update when using the exchange algorithm 

[Nicholls et al| ( |2012| ) show the exchange algorithm, when set 
up to target 'itt(d\y) oc p{d) ft{y\d) in the manner described in 
section [TO] simulates a transition kernel that is in detailed 
balance with 'Kt{6\y). This follows from showing that it satis¬ 
fies a “very detailed balance” condition, which takes account 
of the auxiliary variable u. The result is that the derivation 
of the weight update follows exactly that of (|12[|. 


B An extended space construction for the 
random weight SMC method in section |3.1.1| 

The following extended space construction justifies the use 
of the “approximate” weights in ( |13| ) via an explicit sequen¬ 
tial importance (re)sampling argument along the lines of jDelj 
[Moral et alj | |2006[ ) , albeit with a slightly different sequence of 
target distributions. 

Consider an actual sequence of target distributions {7rt}t>o. 
Assume we seek to approximate a normalising constant dur¬ 
ing every iteration by introducing additional variables Ut = 
(uj,. .., u2) during iteration t > 0. 

Define the sequence of target distributions: 

{Xt = {0o,0l,Ui,. . .,0t,Ut)) 
t-i 

■.=nti0t) n Ls(0s+i,0sy 
s=0 

t M 

n M E fs-i{nT\e.-i) n M<\ 0 s-i) 

s = l m = l q^m 

where Ls has the same role and interpretation as it does in a 
standard SMC sampler. 

Assume that at iteration t the auxiliary variables 
are sampled independently (conditional upon the associated 
value of the parameter, 0t_i)and identically according to 
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ft{-\dt — i) and that Kt denotes the incremental proposal dis¬ 
tribution at iteration t, just as in a standard SMC sampler. 

In the absence of resampling, each particle has been sampled 
from the following proposal distribution at time t: 

t t M 

Mt(£t)=/ro(0o) n n n 

s=l s = l m = l 


and hence its importance weight, should be: 

ns =0 Ls(es + l,0a) 
t^oiSo) 0* = ! ds) 

n:=i ^ fs{uT\es-i)] 

Ul=rU^=rfsinT\Os-i) 

_ 7rt(^t)ns=0 Ls{0s + l,ds) -r-r /a - 1 | 1) 

/ro(6>o)n‘ = l M /a(u^|ea_l) 




T^t-i{St-i)Kt{6t-\, 6t) 


1 ft-i{uY^, Ot-i) 

M Mur\0t-i) 


which yields the natural sequential importance sampling in¬ 
terpretation. The validity of the incorporation of resampling 
follows by standard arguments. 

If one has that -Ktidt) oc pi6t)ft{y\0t) = p{dt)'rtiy\dt) /Zt{0t) 
and employs the time reversal of Kt for Lt — i then one arrives 
at an incremental importance weight, at time t of: 

p{0t)ft{y\Ot-i} 1 ^ 

P{0t-i)ft-i{y\0t-i) M ft{uY^\0t-\) 

p{0t)'yt{y\0t-i) 1 jt-i(u]^\0t-i) 

p{0t-i)jt-i{y\0t-i) M ■yt{uY^\0t-i} 

yielding the algorithm described in section 1133 as an exact 
SMC algorithm on the described extended space. 


C Proof of SMC Sampler Error Bound 


A little notation is required. We allow {E,£) to denote the 
common state space of the sampler during each iteration, 
Cb(E) the collection of continuous, bounded functions from E 
to R, and V{E) the collection of probability measures on this 
space. We define the Boltzmann-Gibbs operator associated 
with a potential function G : i? —>• (0, oo) as a mapping, 
'Pa '■ EiE) —^ V{E), weakly via the integrals of any function 

<P e Cb{E) 


J 


ip{x)pG{v){dx) 


J p{dx)G{x)ip{x) 
f p(dx')G(x') 


The integral of a set A under a probability measure 77 is 
written rj(A) and the expectation of a function of A ~ 77 
is written The supremum norm on Cb{E) is defined 

llv’lloo = the total variation distance on 

V{E) isll 7 r- 7 .ll TV = sup^( 7 .(A) —/r(A)). Markov kernels, M : 
E —7- V{E) induce two operators, one on integrable functions 
and the other on (probability) measures: 


V</7 eCb{E) : 


■= J M{-,dy)ip{y) 
(pM){-) ■.= j n{dx)M{x,-)- 


Having established this notation, we note that we have the 
following recursive definition of the distributions we consider: 

Vo =Vo =■ Mo Vt>i rit>b=pQ^^{vt-i) 

and for notational convenience define the transition operators 
as 


Ptivt-i) =pGt-i{Pt-i)Mt ’Ptivt-i) ^{T]t-\)Mt. 

We make use of the (nonlinear) dynamic semigroupoid, which 
we define recursively, via it’s action on a generic probability 
measure 77 , for t G N: 

=Pti'n) d’s,t = Pt{d>s,t-i{'n)) for s <t, 

with <l>t,t{v) = V and defined correspondingly. 

We begin with a lemma which allows us to control the 
discrepancy introduced by Bayesian updating of a measure 
with two different likelihood functions. 


Lemma 1 (Approximation Error) IfAl. holds, then\/rj E 
V{E) and any t G N; 

“ '^G,{n)\\TV < 27- 

Proof. Let At := Gt — Gt and consider a generic (p G Cb{E)\ 


- '^gAv)){p) 

_ v{Gt)v{Gtp) - v{Gt)v(Gtp) 
viGtMGt) 

_ viGt)v{{Gt + At)p) - viiGt + At))v(Gtp) 
v{Gt)v{Gt) 

v{Gt)v{Atip) - v(At)v{Gtp) 
v{Gt)v{Gt) 

Considering the absolute value of this discrepancy, mak¬ 
ing using of the triangle inequality: 


- '^GAn)){p)\ < 


v{Gt) 


viGt) 


pjGtp) 

n{Gt) 


Noting that Gt is strictly positive, we can bound | 77 (Gt( 7 j)|/ 77 (Gt) 
with 77 (Gt|i 77 |)/r 7 (Gt) and thus with Hv’lloo ^■PPiy ^ similar 
strategy to the first term: 




< 




ni^t) 

viGtM) 

viGt) 


ViGt) 

ViGt) 


< 27||v5|Ioo ■ 


(noting that 77 (|/lt|)/ 77 (Gt) < 7 by integration of both sides 
ofAI). □ 

We now demonstrate that, if the local approximation er¬ 
ror at each iteration of the algorithm(characterised by 7 ) is 
sufficiently small then it does not accumulate unboundedly 
as the algorithm progresses. 

Proof of Proposition Qj We begin with a telescopic decom¬ 
position (mirroring the strategy employed for analysing particle 
approximations of these systems in [Del Moral| ( [2004[ )): 

t 

Vt -rjt = '^ ^s-l,t(7?s-l) - ^s,t(7?s). 

S = 1 


V/i eV{E) : 
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We thus establish (noting that 770 = 770 ): 

t 

m - Vt = - <Ps,t{$s{ris-i))- (20) 

S = 1 

Turning our attention to an individual term in this ex¬ 
pansion, noting that: 


we have, by application of a standard Dobrushin contraction 
argument and Lemma ^ 


{•PsiVs-l) - 'PsiVs-l)){y^) 

(21) 

='^g,_i(?? 3 -i)Ms((p) - 


<f>s(p3-l) -^3(5?3 -i) 

(22) 

<(1 - e(M)) <f'G^_^(i7^_i) - Ip'g^_^(57s_i) 


<27(1 - e(M)) 

(23) 


which controls the error introduced instantaneously during 
each step. 

We now turn our attention to controlling the accumula¬ 
tion of error. We make use of ( [Del Moral||2004} Proposition 
4.3.6) which, under assumptions A2 and A3, allows us to de¬ 
duce that for any probability measures fi, v. 

i|^s,s+fc(7r) - ^s,s+fc(7/)||rpY < /3(0s,s+fe) ll/r - 

where 

P(‘^s,s+k) = - e^(M))'^. 

Returning to decomposition (( |20| ), applying the triangle 
inequality and this result, before finally inserting (( |23[ )) we 
arrive at: 


Algorithm 1 Random weight SMC sampler with 
MCMC move and data point tempering 


for p = 1 to P do 
Draw ~ p(-) 
for 777, = 1 to M do 

end for ^ 

Find the estimate - tjw using (|16|) 

^iC^o ) '—' 

Find incremental weight = 7i(?/|0o^^)-yjs- 

^i(^0 ) 

end for 

Resample the set of particles and set = 1/P. 
for t = 1 to T do 
for p = 1 to P do 

for 771 = 1 to M do 

end for 


Find the estimate 


z.(eS) 


using 


Calculate wl = 




end for 

Resample the set of particles and set = 1/P 
for p = 1 to P do 

Draw ~ where K is an MCMC kernel 

end for 
end for 


ht 


S = 1 

2(l-e^(M)) 

“.=1 

2(l-e^(M)) 


i))-<2>«,t(<?. ( 77 .- 1 )) 


TV 


^.(i?.-i) - ‘^siVs-lh 


■27(l-e(M)) 


TV 


47(1 - e(M)) 
<M)e(G) 


^(1 -c2 (M))*-« 

S = 1 


This is trivially bounded over all t by the geometric series 
and a little rearrangement yields the result: 


47(1 - e(M)) 
<M)e{G) 


3 = 0 


47(1 - e(M)) 
£3(M)£(G) 


□ 


D Pseudo code for random weight SMC 
sampler 

This appendix contains the simplest form of the random weight 
SMC sampler used in the data point tempering examples in 
section]^ in which resampling is performed at every step. Es¬ 
sentially, any standard improvements to SMC algorithms can 
be applied. 



















